# Q1
#bnb <- read.csv("sf_listings_2312.csv")
03 - Logistic Regression
University of San Francisco, MSMI-608
Pre-Class Code Assignment Instructions
In this semester, I am going to ask you to do a fair bit of work before coming to class. This will make our class time shorter, more manageable, and hopefully less boring.
I am also going to use this as an opportunity for you to directly earn grade points for your effort/labor, rather than “getting things right” on an exam.
Therefore, I will ask you to work through the posted slides on Canvas before class. Throughout the slides, I will post Pre-Class Questions for you to work through in R
. These will look like this:
Pre-Class Q1
In R
, please write code that will read in the .csv
from Canvas called sf_listings_2312.csv
. Assign this the name bnb
.
You will then write your answer in a .r script:
Important:
To earn full points, you need to organize your code correctly. Specifically, you need to:
- Answer questions in order.
- If you answer them out of order, just re-arrange the code after.
- Preface each answer with a comment (
# Q1
/# Q2
/# Q3
) that indicates exactly which question you are answering.- Please just write the letter Q and the number in this comment.
- Make sure your code runs on its own, on anyone’s computer.
- To do this, I would always include
rm(list = ls())
at the top of every .r script. This will clean everything from the environment, allowing you to see if this runs on my computer.
- To do this, I would always include
Handing this in:
- You must submit this to Canvas before 9:00am on the day of class. Even if class starts at 10:00am that day, these are always due at 9:00.
- You must submit this code as a
.txt
file. This is because Canvas cannot present.R
files to me in SpeedGrader. To save as.txt
:- Click File -> New File -> Text File
- Copy and paste your completed code to that new text file.
- Save the file as
firstname_lastname_module.txt
- For example, my file for Module 01 would be
matt_meister_01.txt
- My file for module 05 would be
matt_meister_05.txt
- For example, my file for Module 01 would be
Grading:
- I will grade these for completion.
- You will receive 1 point for every question you give an honest attempt to answer
- Your grade will be the number of questions you answer, divided by the total number of questions.
- This is why it is important that you number each answer with
# Q1
. - Any questions that are not numbered this way will be graded incomplete, because I can’t find them.
- This is why it is important that you number each answer with
- You will receive a 25% penalty for submitting these late.
- I will post my solutions after class.
Setup
Load in these packages. If you do not have them, you will need to install them.
- e.g.,
install.packages("dplyr")
library(stringr)
library(dplyr)
library(ggplot2)
library(tidyr)
Read in airline_satisfaction.csv
from Canvas, and assign it the name data
:
<- read.csv('airline_satisfaction.csv') data
And read in airbnb_logistic.csv
from Canvas, and assign it the name airbnb.data
:
<- read.csv('airbnb_logistic.csv') airbnb.data
Binary Outcomes
Thus far, we’ve spent a lot of time talking about “linear” regression. Linear regression is called linear because it calculates the best equation of a line to fit our data. Effectively–what is the best line we can draw through our data points?
Now, we will introduce logistic regression. Logistic regression is similar to linear regression, in that we are drawing a line through points in our data, which we do to predict some outcome. However, logistic regression is used in situations where our outcome is either a 0 or a 1 (or a yes/no, or a TRUE/FALSE). With a logistic regression, we are predicting the probability of an outcome. (In linear regression, we are predicting the level of an outcome.)
For example, consider last week’s Airbnb data. We might be interested in staying at an Airbnb in North Beach, but terrified that the density of apartments and restaurants–which attracted us in the first place–also means that Airbnbs close to the center of North Beach have a rat problem. Being well-informed statisticians (and newly skilled at text analysis), we could analyze whether reviews are more likely to mention rats if they are closer to the center of North Beach.1
Pre-Class Q1
In the data set airbnb_logistic.csv
, you have observations from reviews of the distance
of an Airbnb from the center of North Beach (in miles), and whether the review mentions rats
. Remember that we can only use a logistic regression on outcomes that are binary. So to start, create a histogram for each variable (distance
and rats
), to check that rats
is binary, and that distance
is not crazy.
hist(airbnb.data$rats)
hist(airbnb.data$distance)
Pre-Class Q2
Now, using a linear regression, test whether distance
is related to rats
. Then, plot this with ggplot()
, using geom_smooth(method = "lm")
.
summary(lm(data = airbnb.data, rats ~ distance))
Call:
lm(formula = rats ~ distance, data = airbnb.data)
Residuals:
Min 1Q Median 3Q Max
-0.49174 -0.21301 -0.02665 0.15364 0.73748
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.56044 0.05812 9.643 7.21e-16 ***
distance -0.10907 0.01374 -7.940 3.40e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2889 on 98 degrees of freedom
Multiple R-squared: 0.3915, Adjusted R-squared: 0.3853
F-statistic: 63.05 on 1 and 98 DF, p-value: 3.399e-12
ggplot(data = airbnb.data,
aes( x = distance,
y = rats)) +
geom_smooth(method = 'lm')
You should see that mention of rats
decreases with distance
, and that this is statistically significant:
Because our DV is either 0 or 1, the coefficient can be interpreted as a change in probability of the outcome with a one-unit change in the IV. And each point on the line shows us the predicted probability of rats
at a given distance
. In cases like this (i.e., with binary dependent variables), linear regressions are also referred to as “linear probability models”. This is good enough to tell us that rats
are more prevalent as we get closer to North Beach. However, there are two oddities you should notice with this relationship:
First, linearity:
A linear regression always predicts straight lines. As a result, it predicts that the relationship between rats
and distance
is perfectly linear. This regression is saying that the likelihood of encountering rats changes by -10.9%2 with each mile further from North Beach.
Sometimes linearity is correct. Sometimes, people’s behavior does change according to some constant relationship with a predictor. But, often, relationships are not linear.
One common example of this in marketing is the relationship between advertising spending and sales. When we increase ad spending, sales also increase. However, this is not linear. When we go from spending $0 to $100, we should see a big bump in sales. But if we go from spending $100,000 to $100,100, it is very unlikely we see any difference at all. This specific case is referred to as “diminishing marginal returns”, and applies to a lot of contexts. You can read more at this link.
So in our example, we might be concerned that the relationship between distance
and rats
is also not linear. We’ll talk about when this is a problem (vs when linearity is “fine enough”) later.
Second, impossible predictions:
Look back up at the plot you made with your linear prediction. Remember, rats
is binary. It can be 0 or 1. Therefore, our predictions are probabilities. They should be between 0 and 1–0 corresponding to “no chance”, and 1 corresponding to “certainty”. Let’s test if we satisfy that here.
Pre-Class Q3
Use the predict()
function to predict the probability of rats
at distances of 0, 1, and 6 miles, using your linear regression. Remember that predict()
takes three arguments: (1) An object
(which is our model, in this case a linear regression), (2) newdata
(which is a data.frame
with the exact same variable names as in our regression), and (3) type
(which we will always set to response
in this class).
<- lm(data = airbnb.data, rats ~ distance)
linear_model <- data.frame(distance = c(0, 1, 6))
new_data predict(object = linear_model,
newdata = new_data,
type = "response")
1 2 3
0.56044316 0.45137074 -0.09399136
You might have expected this after looking at the plot above. When distance
is relatively close, your predictions are fine–they are between 0 and 1. But, when distance
is 6, your prediction is negative! This happens because a linear regression has to fit a straight line to our data. So, while it fits the best line it can, that linearity can cause it to make impossible predictions–probabilities above 1 and below 0.
So maybe we need something better. In class and in your homework, we will see that, often, linear regression is fine in practice. But before we do that, let’s consider the alternative–logistic regression.
Logistic Regression
Logistic regression models the probability of an outcome using an S-shaped curve called the logistic function. This S-shape ensures that probabilities are between 0 and 1. To run a logistic regression in R, we use a function very similar to lm()
, with two changes. For example, to test our question above using linear regression, we wrote:
lm(data = airbnb.data, rats ~ distance)
But to test our question using logistic regression, we write:
glm(data = airbnb.data, rats ~ distance, family = "binomial")
The two changes are:
- Using
glm()
, which stands for Generalized Linear Model, and allows for different forms of regression. - Specifying
family = "binomial"
, which tells R that the dependent variable is binary, so it should use the logistic function.
Additionally, when we plot this using ggplot()
, we change the code in just two ways again. Instead of writing geom_smooth(method = 'lm')
, we:
- Change
method = 'lm'
tomethod = 'glm'
- Inside of
geom_smooth()
(aftermethod = 'glm'
), we add the argumentmethod.args = "binomial"
- Note that sometimes you will also have to add the argument
formula = y ~ x
Pre-Class Q4
Now, using a logistic regression, test whether distance
is related to rats
. Then, plot this with ggplot()
, using geom_smooth(method = "glm", method.args = "binomial")
.
summary(glm(data = airbnb.data, rats ~ distance, family = "binomial"))
Call:
glm(formula = rats ~ distance, family = "binomial", data = airbnb.data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.8001 0.9715 2.882 0.003949 **
distance -2.5101 0.7139 -3.516 0.000438 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 87.934 on 99 degrees of freedom
Residual deviance: 30.340 on 98 degrees of freedom
AIC: 34.34
Number of Fisher Scoring iterations: 8
ggplot(data = airbnb.data,
aes( x = distance,
y = rats)) +
geom_smooth(method = "glm",
method.args = "binomial")
In this case, the S-shape is backwards, because rats
decrease with distance
. But you should notice that this model tells a very different story, and makes very different predictions to the linear model. This is super obvious if you look at the two plots side-by-side, including points:
Obviously, the linear regression was not flexible to pick up on the fact that this relationship is actually not linear. The difference in rats from 0-1 mile is much larger than the distance from 3-4, or even 3-6! There are also no rats further than 3 miles away, and very few at 2 miles away. The logistic regression can fit itself to all of this information, whereas the linear regression cannot. Ideally, this makes better predictions.
Pre-Class Q5
Use the predict()
function to predict the probability of rats
at distances of 0, 1, and 6 miles, using your logistic regression. Remember that predict()
takes three arguments: (1) An object
(which is our model, in this case a logistic regression), (2) newdata
(which is a data.frame
with the exact same variable names as in our regression), and (3) type
(which we will always set to response
in this class).
<- glm(data = airbnb.data, rats ~ distance, family = "binomial")
logistic_model <- data.frame(distance = c(0, 1, 6))
new_data predict(object = logistic_model,
newdata = new_data,
type = "response") |>
round(3)
1 2 3
0.943 0.572 0.000
Conveniently, we don’t get any impossible predictions!
Coefficients in a logistic regression
So far, logistic regression has been all good news. Now is when the bad begins. Look at the coefficients we got for that logistic regression:
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 2.80 | 0.972 | 2.882 | 0.004 |
distance | -2.51 | 0.714 | -3.516 | 0.000 |
How is the intercept above 1 if the whole curve is between 0 and 1? And how is the coefficient for distance
-2.51 if a one-unit change in distance clearly does not lead to a 2.51 unit decrease in rats
?
Unfortunately, explaining this will take some math.
The key to understanding these results is to be aware that both the intercept and “slope” in a logistic regression are expressed in log-odds, not probabilities.
- Odds are the ratio of the probability of success (\(p\)) to failure (\(1-p\)).
- \(\text{odds} = \frac{\text{p}}{1-\text{p}}\)
- The log-odds is the natural logarithm of these odds (\(\text{log-odds} = \log(\frac{\text{p}}{1-\text{p}})\)).
- We’ve seen the natural logarithm before, when we used
log()
in R.
- We’ve seen the natural logarithm before, when we used
- Meanwhile, probability is equal to \(\text{p} = \frac{\text{odds}}{1 + \text{odds}}\)
- Starting with:
- \(\text{odds} = \frac{p}{1-p}\)
- Multiply both sides by \(1-p\) to eliminate the denominator:
- \(\text{odds} \times (1-p) = p\)
- Distribute \(\text{odds}\):
- \(\text{odds} - \text{odds} \times p = p\)
- Rearrange to isolate \(p\):
- \(\text{odds} = p + \text{odds} \times p\)
- Factor \(p\) out of the right-hand side:
- \(\text{odds} = p \times (1 + \text{odds})\)
- Divide through by \(1 + \text{odds}\):
- \(p = \frac{\text{odds}}{1 + \text{odds}}\)
- Starting with:
Why is the intercept above 1?
Conveniently, the intercept is still the value of something when the independent variable is 0. But this is not the predicted value of the dependent variable (as it is with linear regression). With logistic regression, everything is in log-odds, including the intercept.
The intercept is the log-odds of the outcome (\(\text{rats} = 1\)) when the independent variable ($ $) is 0.
In our example
The intercept is 2.8. Remember also that probability equals \(p = \frac{\text{odds}}{1 + \text{odds}}\). And, that log-odds are \(\text{log}(\text{odds})\). That means that when distance
is zero:
\(\text{log}(\text{odds}) = 2.8\)
You can convert log-odds to odds by exponentiation. In R, this is the function exp()
:
\(\text{odds} = e^{2.8}\) \(\text{odds} = 16.44\)
We can sub this into the probability equation:
\(p = \frac{\text{16.44}}{1+\text{16.44}}\)
And then just regular ole’ math:
\(p = \frac{\text{16.44}}{\text{17.44}}\) \(p \approx 0.943\)
When distance
is zero, the predicted probability of a review mentioning a rat is \(\approx\) 94.3%.
Note:
This math is not a lot of fun. Here is a function that will take log-odds, and give you probability. Save this function somewhere and source
it when you need to use log-odds.
<- function(log_odds){
prob_from_log_odds <- exp(log_odds) # Exponentiate first
odds <- odds / (1 + odds)
prob return(prob)
}
Here is how you can use it:
# Provide regression coefficient directly
prob_from_log_odds(logistic_model$coefficients[[1]])
[1] 0.942682
# Input a number
prob_from_log_odds(16.44)
[1] 0.9999999
How can the coefficient for distance
be -2.51 if a one-unit change doesn’t lead to a 2.51 unit decrease in rats
?
This happens because, again, the coefficient is expressed in log-odds, not the probability scale. A coefficient of -2.51 means that for a one-unit increase in distance
, the log-odds of rats
decreases by 2.51.
In terms of odds, the change (expressed by \(\Delta\)) is multiplicative:
\(\Delta \text{log}(\text{odds}) = -2.51\) \(\Delta \text{odds} = e^{-2.51}\) \(\Delta \text{odds} \approx 0.081\)
This means the odds of rats
are reduced by 8.1% of the prior odds for each one-unit increase in distance
. This is what allows the logistic regression to curve.
My suggestion
Logistic regression coefficients are quite confusing if you’re not well-versed in this math. This interpretability is their biggest flaw. Instead of trying to understand them in detail, I would use the predict()
function in R, to see how your predicted probabilities change at different levels of your independent variable.
What you need to take away from this is not the specifics of the math, but the ideas that:
- Logistic regression provides coefficients in log-odds, not probabilities.
- This allows the logistic regression function to not hit 0 or 1.
- This also allows the logistic regression function to curve, identifying non-linear relationships between our dependent variable and independent variable.
Airline Satisfaction
The data set airline_satisfaction.csv
on Canvas contains the results of a survey of airline customers. These responses were collected immediately after getting off a plane trip. The airline collected some personal and travel information from each customer, and then asked about their satisfaction with a number of elements of their trip.
In class, we will practice testing logistic regressions, comparing them to linear regressions, and show how we can make predictions of new data with these functions. In the pre-class workshop, I would like you to orient yourself with the data.
Pre-Class Q6
Our dependent variable is satisfaction
. Summarise this variable, and see what (if anything) we have to do in order to use it in a regression.
summary(data$satisfaction)
Length Class Mode
25976 character character
unique(data$satisfaction)
[1] "satisfied" "neutral or dissatisfied"
#This variable is a character, so we are going to have to turn it into 0/1
$satisfied <- ifelse(data$satisfaction == "satisfied", 1, 0) data
Pre-Class Q7
What percentage of customers are satisfied?
mean(data$satisfied)
[1] 0.4389821
Pre-Class Q8
Now, let’s get some ideas for a potential hypothesis. You can look at the other variables we have with str()
and summary()
.
str(data)
'data.frame': 25976 obs. of 26 variables:
$ X : int 0 1 2 3 4 5 6 7 8 9 ...
$ id : int 19556 90035 12360 77959 36875 39177 79433 97286 27508 62482 ...
$ Gender : chr "Female" "Female" "Male" "Male" ...
$ Customer.Type : chr "Loyal Customer" "Loyal Customer" "disloyal Customer" "Loyal Customer" ...
$ Age : int 52 36 20 44 49 16 77 43 47 46 ...
$ Type.of.Travel : chr "Business travel" "Business travel" "Business travel" "Business travel" ...
$ Class : chr "Eco" "Business" "Eco" "Business" ...
$ Flight.Distance : int 160 2863 192 3377 1182 311 3987 2556 556 1744 ...
$ Inflight.wifi.service : int 5 1 2 0 2 3 5 2 5 2 ...
$ Departure.Arrival.time.convenient: int 4 1 0 0 3 3 5 2 2 2 ...
$ Ease.of.Online.booking : int 3 3 2 0 4 3 5 2 2 2 ...
$ Gate.location : int 4 1 4 2 3 3 5 2 2 2 ...
$ Food.and.drink : int 3 5 2 3 4 5 3 4 5 3 ...
$ Online.boarding : int 4 4 2 4 1 5 5 4 5 4 ...
$ Seat.comfort : int 3 5 2 4 2 3 5 5 5 4 ...
$ Inflight.entertainment : int 5 4 2 1 2 5 5 4 5 4 ...
$ On.board.service : int 5 4 4 1 2 4 5 4 2 4 ...
$ Leg.room.service : int 5 4 1 1 2 3 5 4 2 4 ...
$ Baggage.handling : int 5 4 3 1 2 1 5 4 5 4 ...
$ Checkin.service : int 2 3 2 3 4 1 4 5 3 5 ...
$ Inflight.service : int 5 4 2 1 2 2 5 4 3 4 ...
$ Cleanliness : int 5 5 2 4 4 5 3 3 5 4 ...
$ Departure.Delay.in.Minutes : int 50 0 0 0 0 0 0 77 1 28 ...
$ Arrival.Delay.in.Minutes : num 44 0 0 6 20 0 0 65 0 14 ...
$ satisfaction : chr "satisfied" "satisfied" "neutral or dissatisfied" "satisfied" ...
$ satisfied : num 1 1 0 1 1 1 1 1 1 1 ...
summary(data)
X id Gender Customer.Type
Min. : 0 Min. : 17 Length:25976 Length:25976
1st Qu.: 6494 1st Qu.: 32170 Class :character Class :character
Median :12988 Median : 65320 Mode :character Mode :character
Mean :12988 Mean : 65006
3rd Qu.:19481 3rd Qu.: 97584
Max. :25975 Max. :129877
Age Type.of.Travel Class Flight.Distance
Min. : 7.00 Length:25976 Length:25976 Min. : 31
1st Qu.:27.00 Class :character Class :character 1st Qu.: 414
Median :40.00 Mode :character Mode :character Median : 849
Mean :39.62 Mean :1194
3rd Qu.:51.00 3rd Qu.:1744
Max. :85.00 Max. :4983
Inflight.wifi.service Departure.Arrival.time.convenient Ease.of.Online.booking
Min. :0.000 Min. :0.000 Min. :0.000
1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
Median :3.000 Median :3.000 Median :3.000
Mean :2.725 Mean :3.047 Mean :2.757
3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :5.000 Max. :5.000 Max. :5.000
Gate.location Food.and.drink Online.boarding Seat.comfort
Min. :1.000 Min. :0.000 Min. :0.000 Min. :1.000
1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
Median :3.000 Median :3.000 Median :4.000 Median :4.000
Mean :2.977 Mean :3.215 Mean :3.262 Mean :3.449
3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
Inflight.entertainment On.board.service Leg.room.service Baggage.handling
Min. :0.000 Min. :0.000 Min. :0.00 Min. :1.000
1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:3.000
Median :4.000 Median :4.000 Median :4.00 Median :4.000
Mean :3.358 Mean :3.386 Mean :3.35 Mean :3.633
3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:5.000
Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
Checkin.service Inflight.service Cleanliness Departure.Delay.in.Minutes
Min. :1.000 Min. :0.000 Min. :0.000 Min. : 0.00
1st Qu.:3.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.: 0.00
Median :3.000 Median :4.000 Median :3.000 Median : 0.00
Mean :3.314 Mean :3.649 Mean :3.286 Mean : 14.31
3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.: 12.00
Max. :5.000 Max. :5.000 Max. :5.000 Max. :1128.00
Arrival.Delay.in.Minutes satisfaction satisfied
Min. : 0.00 Length:25976 Min. :0.000
1st Qu.: 0.00 Class :character 1st Qu.:0.000
Median : 0.00 Mode :character Median :0.000
Mean : 14.74 Mean :0.439
3rd Qu.: 13.00 3rd Qu.:1.000
Max. :1115.00 Max. :1.000
NA's :83
Pre-Class Q9
Create a scatterplot to visualize the relationship between Flight.Distance
and satisfied
.
ggplot(data = data, aes(x = Flight.Distance, y = satisfied)) +
geom_point() +
geom_smooth(method = "glm",
method.args = "binomial") +
coord_cartesian(ylim = c(-.1,1.1))
Pre-Class Q10
It looks like Flight.Distance
might have some outliers. We can see if this is the case with a histogram.
hist(data$Flight.Distance)
Pre-Class Q11
To fix Flight.Distance
, let’s log-transform it with log()
. Assign this to the variable data$log_Flight.Distance
.
$log_Flight.Distance <- log(data$Flight.Distance)
datahist(data$log_Flight.Distance)
If you stop here
You are ready for class. If you want extra practice and full points, try these questions
Pre-Class Q12
You might have noticed that the logistic regression fit our airbnb.data
nicely, but that the coefficients it gave us were… gross. In class, I will talk about how in practice, a linear regression is often good enough, because our relationships are often close enough to linear. True, strong S-shapes are rare in nature, because they suggest that our outcome suddenly flips/turns on at some level.
However, even when our data do follow an S-shape, we can often still manipulate our variables to make a linear regression make sense. For extra practice and a challenge, I would like you to try manipulating the variable distance
so that a linear regression will make sense and fit the data. Try doing this without looking at my answer first, and submit something of your own creation!
<div> I am eyeballing distance = 1 as a point where the likelihood of rats changes sharply. I'll use this point to split our data. </div>
$distance_bin <- ifelse(airbnb.data$distance > 1, 1, 0)
airbnb.data<- lm(data = airbnb.data, rats ~ distance_bin)
linear_model_bin summary(linear_model_bin)
Call:
lm(formula = rats ~ distance_bin, data = airbnb.data)
Residuals:
Min 1Q Median 3Q Max
-0.86667 -0.03529 -0.03529 -0.03529 0.96471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.86667 0.05611 15.45 <2e-16 ***
distance_bin -0.83137 0.06086 -13.66 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2173 on 98 degrees of freedom
Multiple R-squared: 0.6557, Adjusted R-squared: 0.6522
F-statistic: 186.6 on 1 and 98 DF, p-value: < 2.2e-16
ggplot(data = airbnb.data,
aes( x = distance_bin,
y = rats)) +
geom_point() +
geom_smooth(method = "lm") +
coord_cartesian(ylim = c(-.1,1.1))
<- data.frame(distance = c(0, 1, 6))
new_data $distance_bin <- ifelse(new_data$distance > 1, 1, 0)
new_datapredict(object = linear_model_bin,
newdata = new_data,
type = "response") |>
round(3)
1 2 3
0.867 0.867 0.035
<div> If we think there is a linear relationship in some subset of distance, but no relationship in another, we could create a variable that just cuts off distance when we think the relationship stops. </div>
$distance_cut <- ifelse(airbnb.data$distance > 1, 1,
airbnb.data$distance) # Here, we are keeping distances inside of 1
airbnb.data<- lm(data = airbnb.data, rats ~ distance_cut)
linear_model_cut summary(linear_model_cut)
Call:
lm(formula = rats ~ distance_cut, data = airbnb.data)
Residuals:
Min 1Q Median 3Q Max
-0.47527 -0.06613 -0.06613 -0.06613 0.93387
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.17150 0.09542 12.28 <2e-16 ***
distance_cut -1.10536 0.10069 -10.98 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.248 on 98 degrees of freedom
Multiple R-squared: 0.5515, Adjusted R-squared: 0.5469
F-statistic: 120.5 on 1 and 98 DF, p-value: < 2.2e-16
ggplot(data = airbnb.data,
aes( x = distance_cut,
y = rats)) +
geom_point() +
geom_smooth(method = "lm") +
coord_cartesian(ylim = c(-.1,1.1))
<- data.frame(distance = c(0, 1, 6))
new_data $distance_cut <- ifelse(new_data$distance > 1, 1,
new_data$distance)
new_datapredict(object = linear_model_cut,
newdata = new_data,
type = "response") |>
round(3)
1 2 3
1.171 0.066 0.066
Pre-Class Q13
The whole point of the classes you have with me is learning how to make better predictions. That is especially true for logistic regression, where we are taking something we knew well (linear regression), and adding a lot of complexity.
For that complexity to be worth it, we have to be making better predictions. To see if we are, we can:
- Split our data into training and testing sets
- Train a regression on the training set
- Make predictions on the testing set
- See how close those are to our observations
In this question, I want you to split your airbnb.data into a training set of 80 rows, and a testing set of 20. Then, train a logistic regression on the training set, and see how well it predicts the test data.
set.seed(123)
#Shuffle airbnb.data
<- airbnb.data[sample(1:nrow(airbnb.data), size = nrow(airbnb.data)),]
airbnb.data <- airbnb.data[1:80, ]
train_df <- airbnb.data[81:100, ]
test_df
<- glm(data = train_df, rats ~ distance, family = "binomial")
logistic_model summary(logistic_model)
Call:
glm(formula = rats ~ distance, family = "binomial", data = train_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.0339 1.1768 2.578 0.00993 **
distance -2.5553 0.8314 -3.074 0.00211 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 71.007 on 79 degrees of freedom
Residual deviance: 26.341 on 78 degrees of freedom
AIC: 30.341
Number of Fisher Scoring iterations: 8
$prediction <- predict(logistic_model,
test_dfnewdata = test_df,
type = "response")
# MSE
mean((test_df$prediction - test_df$rats)^2)
[1] 0.03454432
# Make a "confusion matrix" by binarizing the DV
$predicted_rats <- ifelse(test_df$prediction >= .5, "Predict Rats", "Predict No Rats")
test_dftable(test_df$predicted_rats, test_df$rats)
0 1
Predict No Rats 16 0
Predict Rats 1 3
Pre-Class Q14
Now, train a linear regression on the training set, and see how well it predicts the test data.
<- lm(data = train_df, rats ~ distance)
linear_model summary(linear_model)
Call:
lm(formula = rats ~ distance, data = train_df)
Residuals:
Min 1Q Median 3Q Max
-0.45942 -0.22120 -0.03261 0.15540 0.73726
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.56308 0.06713 8.388 1.66e-12 ***
distance -0.10996 0.01605 -6.853 1.50e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2952 on 78 degrees of freedom
Multiple R-squared: 0.3758, Adjusted R-squared: 0.3678
F-statistic: 46.96 on 1 and 78 DF, p-value: 1.501e-09
$prediction <- predict(linear_model,
test_dfnewdata = test_df,
type = "response")
# MSE
mean((test_df$prediction - test_df$rats)^2)
[1] 0.06913223
# Make a "confusion matrix" by binarizing the DV
$predicted_rats <- ifelse(test_df$prediction >= .5, "Predict Rats", "Predict No Rats")
test_dftable(test_df$predicted_rats, test_df$rats)
0 1
Predict No Rats 17 0
Predict Rats 0 3
Footnotes
In this example, our hypothesis is that being further from North Beach leads to a decrease in the mention of “rats” in reviews. The null hypothesis is that distance from North Beach has no effect on the mention of rats.↩︎
Because our DV is either 0 or 1, the coefficient can be interpreted as a change in probability of the outcome with a one-unit change in the IV.↩︎